Paper Reading

AI Chip/Accelerator

Nature
- Neuro-inspired computing chips(2020)
- Illusion of large on-chip memory by networked computing chips for neural network inference(2021)
- Neuromorphic computing at scale(2025)
Science
- Edge learning using a fully integrated neuro-inspired memristor chip(2023)
ISSCC or ISSCC
- A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems(2016)
- A 288μW Programmable Deep-Learning Processor with 270KB On-Chip Weight Storage Using Non-Uniform Memory Hierarchy for Mobile Intelligence(2017)
- A 0.62mW Ultra-Low-Power Convolutional-NeuralNetwork Face-Recognition Processor and a CIS Integrated with Always-On Haar-Like Face Detector(2017)
- A 2.9TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems(2017)
- DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks(2017)
- UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision(2018)
- A 65nm 4Kb Algorithm-Dependent Computing-inMemory SRAM Unit-Macro with 2.3ns and 55.8TOPS/W Fully Parallel Product-Sum Operation for Binary DNN Edge Processors(2018)
- 7.6 A 65nm 236.5nJ/Classification Neuromorphic Processor with 7.5% Energy Overhead On-Chip Learning Using Direct Spike-Only Feedback(2019)
- A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips(2020)
- AMD Chiplet Architecture for High-Performance Server and Desktop Products(2020)
- 15.3 A 351TOPS/W and 372.4GOPS Compute-in-Memory SRAM Macro in 7nm FinFET CMOS for Machine-Learning Applications(2020)
- A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing(2021)
- An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In-Memory Macro in 22nm for Machine-Learning Edge Applications(2021)
- A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations(2022)
- DIMC: 2219TOPS/W 2569F2/b Digital In-Memory Computing Macro in 28nm Based on Approximate Arithmetic Hardware(2022)
- A 4nm 6163-TOPS/W/b \mathbf4790-TOPS/mm^2/b SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update(2023)
- 34.4 A 3nm, 32.5TOPS/W, 55.0TOPS/mm2 and 3.78Mb/mm2 Fully-Digital Compute-in-Memory Macro Supporting INT12 × INT12 with a Parallel-MAC Architecture and Foundry 6T-SRAM Bit Cell(2024)
JSSC or JSSC
- Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks(2017)
- A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications(2018)
- Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning(2021)
- TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes(2023)
- ReDCIM: Reconfigurable Digital ComputingIn-Memory Processor With Unified FP/INT Pipeline for Cloud AI Acceleration(2023)
IEDM or IEDM
- NeuroSim+: An integrated device-to-algorithm framework for benchmarking synaptic devices and array architectures(2017)
- DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies(2019)
- AI Computing in Light of 2.5D Interconnect Roadmap: Big-Little Chiplets for In-memory Acceleration(2022)
- Design of Analog-AI Hardware Accelerators for Transformer-based Language Models(2023)
VLSI or VLSI
- A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications(2017)
- A 40nm Analog-Input ADC-Free Compute-in-Memory RRAM Macro with Pulse-Width Modulation between Sub-arrays(2022)
- A 12nm 121-TOPS/W 41.6-TOPS/mm2 All Digital Full Precision SRAM-based Compute-in-Memory with Configurable Bit-width For AI Edge Applications(2022)
- A 12nm 137 TOPS/W Digital Compute-In-Memory using Foundry 8T SRAM Bitcell supporting 16 Kernel Weight Sets for AI Edge Applications(2023)
ISCA or ISCA
- ShiDianNao: shifting vision processing closer to the sensor(2015)
- ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars(2016)
- PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory(2016)
- EIE: efficient inference engine on compressed deep neural network(2016)
- In-Datacenter Performance Analysis of a Tensor Processing Unit(2017)
- SCALEDEEP:A Scalable Compute Architecture for Learning and Evaluating Deep Networks(2017)
- SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks(2017)
- SpinalFlow: An Architecture and Dataflow Tailored for Spiking Neural Networks(2020)
- ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks(2021)
- A software-defined tensor streaming multiprocessor for large-scale machine learning(2022)
MICRO or MICRO
- Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture(2019)
- Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture(2021)
- Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI(2023)
- SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling(2024)
ASPLOS or ASPLOS
- DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning(2014)
- PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference(2019)
- DOTA: detect and omit weak attentions for scalable transformer acceleration(2022)
- TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays(2024)
DAC or DAC
- Atomlayer: a universal ReRAM-based CNN accelerator with atomic layer computation(2018)
- A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM(2019)
- A Two-way SRAM Array based Accelerator for Deep Neural Network On-chip Training(2020)
- HERO: hessian-enhanced robust optimization for unifying and improving generalization and quantization performance(2022)
- Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration(2022)
- PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators(2023)
- AutoDCIM: An Automated Digital CIM Compiler(2023)
- A Convolution Neural Network Accelerator Design with Weight Mapping and Pipeline Optimization(2023)
- PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators(2023)
- Chiplets: How Small is too Small?(2023)
- HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement(2023)
HPCA or HPCA
- PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning(2017)
- A3: Accelerating Attention Mechanisms in Neural Networks with Approximation(2020)
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning(2021)
- Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing(2021)
- MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores(2022)
- TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer(2022)
- Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators(2024)
- Lightening-Transformer: A Dynamically-operated Optically-interconnected Photonic Transformer Accelerator(2024)
- Prosperity: Accelerating Spiking Neural Networks via Product Sparsity(2025)
- Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs(2025)
ISLPED or ISLPED
- A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices(2023)
ICCAD or ICCAD
- Scaling the “memory wall”(2012)
- OpenRAM: An open-source memory compiler(2016)
- Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs(2019)
- MAGNet: A Modular Accelerator Generator for Neural Networks(2019)
- ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration(2020)
- GAMMA: automating the HW mapping of DNN models on accelerators via genetic algorithm(2020)
- Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic Noise(2021)
- Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators(2022)
- Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture(2022)
GLSVLSI or GLSVLSI
- Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors(2021)
DATE or DATE
- CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory(2012)
- ReCom: An efficient resistive accelerator for compressed deep neural networks(2018)
- TDO-CIM: Transparent Detection and Offloading for Computation In-memory(2020)
- A Fast and Energy Efficient Computing-in-Memory Architecture for Few-Shot Learning Applications(2020)
- Shenjing: A low power reconfigurable neuromorphic accelerator with partial-sum and spike networks-on-chip(2020)
- A Runtime Reconfigurable Design of Compute-in-Memory based Hardware Accelerator(2021)
- In-Memory Computing based Accelerator for Transformer Networks for Long Sequences(2021)
- Gibbon: Efficient Co-Exploration of NN Model and Processing-In-Memory Architecture(2022)
- Achieving Datacenter-scale Performance through Chiplet-based Manycore Architectures(2023)
- SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors(2024)
ASP-DAC or ASP-DAC
- ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks(2018)
- Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator(2019)
- This is SPATEM! A Spatial-Temporal Optimization Framework for Efficient Inference on ReRAM-based CNN Accelerator(2022)
- Improving the Robustness and Efficiency of PIM-Based Architecture by SW/HW Co-Design(2023)
- A Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks(2023)
- MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking Neural Networks(2024)
ISCAS or ISCAS
- Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on RRAM Based Processing-In-Memory Architecture(2019)
- MINT: Mixed-Precision RRAM-Based IN-Memory Training Architecture(2020)
- An 8T SRAM Based Digital Compute-In-Memory Macro For Multiply-And-Accumulate Accelerating(2023)
MWSCAS or MWSCAS
- 8T XNOR-SRAM based Parallel Compute-in-Memory for Deep Neural Network Accelerator(2020)
- Open-Source Memory Compiler for Automatic RRAM Generation and Verification(2021)
TC or TC
- CIMAT: A Compute-In-Memory Architecture for On-chip Training Based on Transpose SRAM Arrays(2020)
- Device-Circuit-Architecture Co-Exploration for Computing-in-Memory Neural Accelerators(2021)
TCAS-I or TCAS-I
- Research Progress on Memristor: From Synapses to Computing Systems(2022)
- ENNA: An Efficient Neural Network Accelerator Design Based on ADC-Free Compute-In-Memory Subarrays(2023)
TCAD or TCAD
- MNSIM: Simulation Platform for Memristor-Based Neuromorphic Computing System(2018)
- DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training(2021)
- OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory(2022)
- SWAP: A Server-Scale Communication-Aware Chiplet-Based Manycore PIM Accelerator(2022)
- H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks(2022)
- ESSENCE: Exploiting Structured Stochastic Gradient Pruning for Endurance-Aware ReRAM-Based In-Memory Training Systems(2023)
- A Coordinated Model Pruning and Mapping Framework for RRAM-Based DNN Accelerators(2023)
- AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers(2023)
- SATA: Sparsity-Aware Training Accelerator for Spiking Neural Networks(2023)
- SpikeSim: An End-to-End Compute-in-Memory Hardware Evaluation Tool for Benchmarking Spiking Neural Networks(2023)
TVLSI or TVLSI
- Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns(2017)
- Benchmark of the Compute-in-Memory-Based DNN Accelerator With Area Constraint(2020)
- An Algorithm–Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers(2022)
- A 40-nm 1.89-pJ/SOP Scalable Convolutional Spiking Neural Network Learning Core With On-Chip Spatiotemporal Back-Propagation(2023)
Others
- Steps toward Artificial Intelligence(1961)
- Analyzing CUDA workloads using a detailed GPU simulator(2009)
- DRAMSim2: A Cycle Accurate Memory System Simulator(2011)
- Unsupervised learning of digit recognition using spike-timing-dependent plasticity(2015)
- Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks(2016)
- Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1(2016)
- Ramulator: A Fast and Extensible DRAM Simulator(2016)
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey(2017)
- FINN: A Framework for Fast, Scalable Binarized Neural Network Inference(2017)
- HBM (High Bandwidth Memory) DRAM Technology and Architecture(2017)
- CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories(2017)
- A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA(2018)
- NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning(2018)
- Motivation for and Evaluation of the First Tensor Processing Unit(2018)
- NVIDIA Tensor Core Programmability, Performance & Precision(2018)
- Loihi: A Neuromorphic Manycore Processor with On-Chip Learning(2018)
- Training Deep Spiking Convolutional Neural Networks With STDP-Based Unsupervised Pre-training Followed by Supervised Fine-Tuning(2018)
- mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator(2019)
- In-Memory Computing: Advances and Prospects(2019)
- Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices(2019)
- Device and materials requirements for neuromorphic computing(2019)
- Timeloop: A Systematic Approach to DNN Accelerator Evaluation(2019)
- Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment(2019)
- Modeling Deep Learning Accelerator Enabled GPUs(2019)
- MLP+NeuroSimV3.0: Improving On-chip Learning Performance with Device to Algorithm Optimizations(2019)
- Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks(2019)
- SpykeTorch: Efficient Simulation of Convolutional Spiking Neural Networks With at Most One Spike per Neuron(2019)
- Bio-inspired digit recognition using reward-modulated spike-timing-dependent plasticity in deep convolutional networks(2019)
- Efficient spiking neural network training and inference with reduced precision memory and computing(2019)
- MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems(2020)
- An Architecture-Level Energy and Area Estimator for Processing-In-Memory Accelerator Designs(2020)
- Compressing Large-Scale Transformer-Based Models: A Case Study on BERT(2020)
- Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer(2020)
- HAT: Hardware-Aware Transformers for Efficient Natural Language Processing(2020)
- DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator(2020)
- OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator(2020)
- Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis(2020)
- A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim(2020)
- Compute-in-RRAM with Limited On-chip Resources(2021)
- Compute-in-Memory Chips for Deep Learning: Recent Trends and Prospects(2021)
- NeuroSim Simulator for Compute-in-Memory Hardware Accelerator: Validation and Benchmark(2021)
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware(2021)
- SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks(2021)
- Wafer Level System Integration of the Fifth Generation CoWoS®-S with High Performance Si Interposer at 2500 mm2(2021)
- VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference(2021)
- Revisiting Batch Normalization for Training Low-Latency Deep Spiking Neural Networks From Scratch(2021)
- Q-SpiNN: A Framework for Quantizing Spiking Neural Networks(2021)
- SSTDP: Supervised Spike Timing Dependent Plasticity for Efficient Spiking Neural Network Training(2021)
- Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication(2022)
- Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs(2022)
- ULECGNet: An Ultra-Lightweight End-to-End ECG Classification Neural Network(2022)
- Design Methodology and Trends of SRAM-Based Compute-in-Memory Circuits(2022)
- Tolerating Noise Effects in Processing-in-Memory Systems for Neural Networks: A Hardware–Software Codesign Perspective(2022)
- On Building Efficient and Robust Neural Network Designs(2022)
- ReaLPrune: ReRAM Crossbar-aware Lottery Ticket Pruned CNNs(2022)
- From Macro To Microarchitecture: Reviews and Trends of SRAM-Based Compute-in-Memory Circuits(2023)
- Side-Channel Attack Analysis on In-Memory Computing Architectures(2023)
- An Ultra-Low Power TinyML System for Real-Time Visual Processing at Edge(2023)
- Hardware-aware Quantization/Mapping Strategies for Compute-in-Memory Accelerators(2023)
- Wafer-scale Computing: Advancements, Challenges, and Future Perspectives(2023)
- Neuro-Symbolic Computing: Advancements and Challenges in Hardware-Software Co-Design(2023)
- Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators(2023)
- A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models(2023)
- Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator(2023)
- End-to-End Benchmarking of Chiplet-Based In-Memory Computing(2023)
- Performance Impact of Architectural Parameters on Chiplet-Based IMC Accelerators(2023)
- The Big Chip: Challenge, Model and Architecture(2023)
- The Rise and Potential of Large Language Model Based Agents: A Survey(2023)
- ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers(2023)
- Knowledge Distillation between DNN and SNN for Intelligent Sensing Systems on Loihi Chip(2023)
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits(2024)
- FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(2024)
- Weight Update Scheme for 1T1R Memristor Array Based Equilibrium Propagation(2024)
- AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology(2024)
- ChipNeMo: Domain-Adapted LLMs for Chip Design(2024)
- Unleashing Energy-Efficiency: Neural Architecture Search without Training for Spiking Neural Networks on Loihi Chip(2024)
- Quantization-Aware Training of Spiking Neural Networks for Energy-Efficient Spectrum Sensing on Loihi Chip(2024)
- Legendre-SNN on Loihi-2: Evaluation and Insights(2024)
- Are SNNs Truly Energy-efficient? — A Hardware Perspective(2024)
- Approximate Adder Tree Design with Sparsity-Aware Encoding and In-Memory Swapping for SRAM-based Digital Compute-In-Memory Macros(2024)
- Workload-Balanced Pruning for Sparse Spiking Neural Networks(2024)
- An all integer-based spiking neural network with dynamic threshold adaptation(2024)

Machine Learning

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures(2009)
ImageNet Classification with Deep Convolutional Neural Networks(2012)
Training deep neural networks with low precision multiplications(2015)
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding(2016)
Attention Is All You Need(2017)
Training and Inference with Integers in Deep Neural Networks(2018)
Improving Language Understanding by Generative Pre-Training(2018)
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference(2018)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(2019)
Language Models are Unsupervised Multitask Learners(2019)
Q8BERT: Quantized 8Bit BERT(2019)
Machine Learning at Facebook: Understanding Inference at the Edge(2019)
Generating Long Sequences with Sparse Transformers(2019)
Fast Transformer Decoding: One Write-Head is All You Need(2019)
HAQ: Hardware-Aware Automated Quantization With Mixed Precision(2019)
Language Models are Few-Shot Learners(2020)
Training high-performance and large-scale deep neural networks with full 8-bit integers(2020)
Longformer: The Long-Document Transformer(2020)
ETC: Encoding Long and Structured Inputs in Transformers(2020)
Big Bird: Transformers for Longer Sequences(2020)
Long Range Arena: A Benchmark for Efficient Transformers(2020)
Low Latency Deep Learning Inference Model for Distributed Intelligent IoT Edge Clusters(2021)
Memory-efficient Transformers via Top-$k$ Attention(2021)
I-BERT: Integer-only BERT Quantization(2021)
Accelerating Sparse Deep Neural Networks(2021)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness(2022)
OPT: Open Pre-trained Transformer Language Models(2022)
GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale(2022)
Neural Architecture Search for Spiking Neural Networks(2022)
Rate Coding or Direct Coding: Which One is Better for Accurate, Robust, and Energy-efficient Spiking Neural Networks?(2022)
Exploring Lottery Ticket Hypothesis in Spiking Neural Networks(2022)
NITI: Training Integer Neural Networks Using Integer-only Arithmetic(2022)
PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++(2022)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU(2023)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models(2023)
Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism(2023)
Efficient Memory Management for Large Language Model Serving with PagedAttention(2023)
Efficiently Scaling Transformer Inference(2023)
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(2023)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache(2023)
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(2023)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models(2023)
Training Spiking Neural Networks Using Lessons From Deep Learning(2023)
OneBit: Towards Extremely Low-bit Large Language Models(2024)
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond(2024)
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(2024)
Efficient Streaming Language Models with Attention Sinks(2024)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization(2024)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models(2024)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(2024)
ThinK: Thinner Key Cache by Query-Driven Pruning(2024)
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM(2024)
MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(2024)
SparQ Attention: Bandwidth-Efficient LLM Inference(2024)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(2024)
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration(2024)
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models(2024)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models(2024)
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving(2024)
Addition is All You Need for Energy-efficient Language Models(2024)
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference(2024)
KV Prediction for Improved Time to First Token(2024)
NITRO-D: Native Integer-only Training of Deep Convolutional Neural Networks(2024)

RISC-V

A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators(2014)
TAIGA: A new RISC-V soft-processor framework enabling high performance CPU architectural features(2017)
Framework and Tools for Undergraduates Designing RISC-V Processors on an FPGA in Computer Architecture Education(2019)
Open-Source RISC-V Processor IP Cores for FPGAs — Overview and Evaluation(2019)
GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors(2021)
RVfpga: Using a RISC-V Core Targeted to an FPGA in Computer Architecture Education(2021)
Design and verification of RISC-V CPU based on HLS and UVM(2021)
A comparative survey of open-source application-class RISC-V processor implementations(2021)
A review of CNN accelerators for embedded systems based on RISC-V(2022)
A Survey of RISC-V CPU for IoT Applications(2022)

Peilin Chen

Paper Reading

AI Chip/Accelerator

Machine Learning

RISC-V